This post showcases the capabilities of the R package distantia
to load, explore, and analyze epidemiological time series.
This tutorial requires the following packages:
distantia:
time series analysis via dynamic time warping.dtw:
dynamic time warping library.future:
parallelized execution.zoo:
management of regular and irregular time series.dplyr: data
frame manipulation.mapview:
easy to use interactive map visualization.reactable:
interactive tables.dygraphs:
interactive time series plots.library(distantia)
library(dtw)
library(zoo)
library(dplyr)
library(mapview)
library(reactable)
library(dygraphs)
This demo focuses on two example data frames shipped with the package
distantia: covid_counties and
covid_prevalence.
covid_countiesStored as a simple features
data frame, covid_counties contains county polygons and
several socioeconomic variables. It is connected to
covid_prevalence by county name, which is stored in the
column name.
The socioeconomic variables available in covid_counties
are:
area_hectares: county surface.population: county population.poverty_percentage: population percentage below the
poverty line.median_income: median county income in dollars.domestic_product: yearly domestic product in
billions (not a typo) of dollars.daily_miles_traveled: daily miles traveled by the
average inhabitant.employed_percentage: percentage of the county
population under employment.Please, take in mind that these variables were included in the dataset because they were easy to capture from on-line sources, not because of their importance for epidemiological analyses.
covid_prevalenceThis data frame contains weekly Covid-19 prevalence in 36 California counties between 2020-03-16 and 2023-12-18. It is derived from a daily prevalence dataset available here.
The prevalence time series has the columns name,
time, with the date of the first day of the week each data
point represents, and prevalence, expressed as proportion
of positive tests. The table below shows the first rows of this data
frame.
This post focuses on the Covid-19 dataset described above to showcase
the applications of distantia to the analysis of
epidemiological time series.
covid_prevalence data frame into a format compatible with
distantia.DISCLAIMER: I am not an epidemiologist, so I will
refrain from interpreting any results, and will focus on technical
details about the usage of distantia insted.
This section shows how to transform the data frame
covid_prevalence into a format compatible with
distantia.
A time series list, or tsl for short,
is a list of zoo
time series representing unique realizations of the same phenomena
observed in different sites, times, or individuals. All data
manipulation and analysis functions in distantia are
designed to be applied to all time series in a tsl at
once.
The function tsl_initialize() transforms time series
stored as long data frames into time series lists.
tsl <- distantia::tsl_initialize(
x = covid_prevalence,
name_column = "name",
time_column = "time"
)
Each element in tsl is named after the county the data
belongs to.
names(tsl)
## [1] "Alameda" "Butte" "Contra_Costa" "El_Dorado"
## [5] "Fresno" "Humboldt" "Imperial" "Kern"
## [9] "Kings" "Los_Angeles" "Madera" "Marin"
## [13] "Merced" "Monterey" "Napa" "Orange"
## [17] "Placer" "Riverside" "Sacramento" "San_Bernardino"
## [21] "San_Diego" "San_Francisco" "San_Joaquin" "San_Luis_Obispo"
## [25] "San_Mateo" "Santa_Barbara" "Santa_Clara" "Santa_Cruz"
## [29] "Shasta" "Solano" "Sonoma" "Stanislaus"
## [33] "Sutter" "Tulare" "Ventura" "Yolo"
The zoo objects within tsl also have the
attribute name to help track the data in case it is
extracted from the time series list.
attributes(tsl[[1]])$name
## [1] "Alameda"
attributes(tsl[[2]])$name
## [1] "Butte"
Each individual zoo object comprises a time index and a
data matrix.
zoo::index(tsl[["Alameda"]]) |>
head()
## [1] "2020-03-16" "2020-03-23" "2020-03-30" "2020-04-06" "2020-04-13"
## [6] "2020-04-20"
zoo::coredata(tsl[["Alameda"]]) |>
head()
## prevalence
## 1 0.01
## 2 0.01
## 3 0.01
## 4 0.01
## 5 0.01
## 6 0.02
This section describes the tools in distantia that may
help develop an intuition on the properties of the data at hand, either
via visualization or descriptive analysis.
The function tsl_plot() provides a static multipanel
visualization of a large number of time series at once.
distantia::tsl_plot(
tsl = tsl,
columns = 3,
guide = FALSE,
text_cex = 1.2
)
Combined with tsl_subset(), it can help focus on
particular counties and zoom-in over specific time periods.
distantia::tsl_plot(
tsl = tsl_subset(
tsl = tsl,
names = c("Los_Angeles", "Kings"),
time = c("2021-09-01", "2022-01-31")
),
guide = FALSE
)
The individual zoo objects in tsl can be
plotted right away using plot() or
distantia::zoo_plot().
distantia::zoo_plot(
x = tsl$Los_Angeles
)
Another good option to plot zoo objects comes with the package dygraphs.
The function dygraphs::dygraph() produces an interactive
visualization that helps localize specific events in time and compare
several time series at once.
dygraphs::dygraph(
data = cbind(tsl$Los_Angeles, tsl$Kings),
ylab = "Covid-19 Prevalence"
)
Time series have at least two dimensions of interest: time, and the variable/s of interest, such as prevalence in this case.
Having a good understanding of the time features of a time series is
important, because some data management decisions depend on it. Details
like length, resolution, time units, or regularity may define how we
design an analysis. In distantia, the function
tsl_time() provides all relevant details about the time in
your time series list.
df_time <- distantia::tsl_time(
tsl = tsl
)
The table below shows the first five rows of df_time,
with the name of the time series, the class of the time index in each
zoo object, the units, length and resolution, and their time ranges.
Here, all time series are regular and have the same time resolution and range, which makes the table above rightfully boring.
Once acquainted with time, having a summary of the values in each
time series also helps make sense of the main properties of the data.
The function tsl_stats() summarizes several important
statistical properties of the time series in tsl.
df_stats <- distantia::tsl_stats(
tsl = tsl,
lags = 1:6 #weeks
)
The first part of the output corresponds with common statistical descriptors, such as centrality and dispersion metrics, and higher moments.
The second part of the output, produced by the lags
argument, shows time series autocorrelation (Moran’s I) for different
time lags. Since the resolution of all time series is 7 days per
data-point, each time lag represents a week. A value of 0.9 for the lag
1 indicates the Pearson correlation between the prevalence of each data
point and its precedent neighbor.
The stats produced by tsl_stats() can be joined to the
spatial data frame covid_counties using their common column
name to link cases.
covid_counties <- dplyr::inner_join(
x = covid_counties,
y = df_stats,
by = "name"
)
This join facilitates mapping any of the descriptive stats. For example, the map below shows the maximum prevalence per county across the complete time period.
mapview::mapview(
covid_counties,
zcol = "q3",
layer.name = "Max prevalence",
label = "name",
col.regions = grDevices::hcl.colors(n = 5, palette = "Zissou 1")
)
The functions tsl_subset() and tsl_stats()
can be combined to generate stats focused in a given period of time or
sub-group of time series. As example, the code below illustrates how to
generate the stats for the year 2021.
df_stats_2021 <- distantia::tsl_stats(
tsl = tsl_subset(
tsl = tsl,
time = c("2021-01-01", "2021-12-31")
),
lags = 1:6
)
The package distantia is specifically designed to
compare time series, and provides two general methods to do so:
Lock-Step (LS) and Dynamic
time warping (DTW). This
article provides a detailed guide on how to apply these methods with
distantia.
The LS method directly compares samples observed at the same time in time series of the same length. It is computationally efficient and easy to interpret but cannot account for differences in outbreak speed or timing, potentially missing underlying similarities when outbreaks are not synchronized.
DTW warps the time axis of the compared time series to maximize shape similarity. It adjusts for differences in the speed and timing of outbreaks, such as epidemic waves peaking at different times in different regions, and works well with time series of varying lengths or temporal resolutions. However, DTW is computationally expensive for large data sets, can be harder to interpret, and is sensitive to data scaling.
Before going forward, let me show a little example to help develop a sense of how DTW and LS work on a small subset of our time series. The figure below shows two years of data for San Francisco, Napa, and Solano.
At first glance, Napa and Solano seem quite well synchronized and look
very similar. On the other hand, San Francisco, in spite of having the
same number of events as the other two, shows a timing difference of
several months.
The table below shows the LS comparison between
these time series. The psi column indicates the
dissimilarity between pairs of time series.
tsl_smol |>
distantia::distantia_ls() |>
dplyr::mutate(
psi = round(psi, 3)
) |>
reactable::reactable(
resizable = TRUE,
striped = TRUE,
compact = TRUE,
wrap = FALSE,
fullWidth = FALSE
)
As expected, LS captures the synchronicity between Napa and Solano
quite well, and returns a psi score indicating a high
similarity. Meanwhile, San Francisco shows a high psi score
when compared with the other two time series, indicating dissimilarity
due to its asynchronous pattern. So far so good!
Now, let’s take a look at the DTW result:
tsl_smol |>
distantia::distantia_dtw() |>
dplyr::mutate(
psi = round(psi, 3)
) |>
reactable::reactable(
resizable = TRUE,
striped = TRUE,
compact = TRUE,
wrap = FALSE,
fullWidth = FALSE
)
Surprising, right? When cancelling out the time shift between time series, it turns out that San Francisco is more similar to Solano than Napa!
This happens because DTW ignores the dates of the observations, and
can link one sample of one time series with one or several samples of
the other time series. The plot below, done with dtw::dtw()
and dtw::dtwPlotTwoWay(), shows first sample of San
Francisco (first case of the dashed red line), with prevalence 0, is
linked to all samples with prevalence 0 at the beginning of the Napa
series. Since the distance of all these links is 0, this operation
cancels out the difference in timing between both time series.
These DTW results indicate that, for the given period of time, San Francisco, Napa, and Solano might be parts of the same system, in spite of the time-delay between them. This is a conclusion that cannot be achieved when applying the lock-step method!
It is for this reason that DTW is preferable when outbreak timing varies significantly and shape similarity is more important than exact time alignment. On the other hand, LS is best applied when the question at hand is focused on time-series synchronization rather than shape similarity.
##Computing Dissimilarity with distantia
Now that the features of DTW and LS are clear, we are going to compute dissimilarity scores with both methods for the complete dataset.
#lock-step
psi_ls <- distantia::distantia_ls(tsl = tsl)
#dynamic time warping
psi_dtw <- distantia::distantia_dtw(tsl = tsl)
The p-value represents the probability of finding a lower psi (higher similarity) when the time series are permuted. Low p-values indicate a strong similarity, better than expected by chance.